AITopics | nucleotide transformer

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(3 more...)

Neural Information Processing SystemsFeb-15-2026, 16:18:26 GMT

86ab6927ee4ae9bde4247793c46797c7-Paper-Conference.pdf

large language model, machine learning, natural language, (19 more...)

Country:

North America > United States (0.46)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.67)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.68)
Information Technology (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

arXiv.org Artificial IntelligenceOct-29-2025

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Duan, Qihao, Huang, Bingding, Song, Zhenqiao, Lehmann, Irina, Gu, Lei, Eils, Roland, Wild, Benjamin

Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA

large language model, machine learning, natural language, (19 more...)

2505.17257

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 06:48:05 GMT

79af547fa22cdcb0facd0b31dcd4bdb0-Paper-Conference.pdf

basic unit, mxdna, sequence, (15 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
(2 more...)

Neural Information Processing SystemsOct-9-2025, 00:25:32 GMT

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution Eric Nguyen

Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution (i.e. DNA "characters") where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity.

large language model, machine learning, natural language, (19 more...)

Country:

North America > United States (0.46)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.67)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.48)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Ibias, Alfredo, Rodriguez-Galindo, Miguel, Antona, Hector, Ramirez-Miranda, Guillem, Guinovart, Enric

Beating Transformers using Synthetic Cognition

arXiv.org Artificial IntelligenceJul-3-2025

The road to Artificial General Intelligence goes through the generation of context-aware reactive behaviors, where the Transformer architecture has been proven to be the state-of-the-art. However, they still fail to develop reasoning. Recently, a novel approach for developing cognitive architectures, called Synthetic Cognition, has been proposed and implemented to develop instantaneous reactive behavior. In this study, we aim to explore the use of Synthetic Cognition to develop context-aware reactive behaviors. We propose a mechanism to deal with sequences for the recent implementation of Synthetic Cognition, and test it against DNA foundation models in DNA sequence classification tasks. In our experiments, our proposal clearly outperforms the DNA foundation models, obtaining the best score on more benchmark tasks than the alternatives. Thus, we achieve two goals: expanding Synthetic Cognition to deal with sequences, and beating the Transformer architecture for sequence classification.

large language model, machine learning, natural language, (21 more...)

2504.07619

Country:

North America > United States (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.91)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

arXiv.org Artificial IntelligenceMay-28-2025

HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

Yang, Hexiong, Chen, Mingrui, Huang, Huaibo, Duan, Junxian, Cao, Jie, Zhou, Zhen, He, Ran

Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 $\times$ larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.

artificial intelligence, machine learning, natural language, (16 more...)

2505.20836

Country: Europe > Austria (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Albors, Carlos, Li, Jianan Canal, Benegas, Gonzalo, Ye, Chengzhong, Song, Yun S.

A Phylogenetic Approach to Genomic Language Modeling

arXiv.org Artificial IntelligenceMar-4-2025

Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

genome, phylogpn, sequence, (16 more...)

2503.03773

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceDec-18-2024

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Qiao, Lifeng, Ye, Peng, Ren, Yuchen, Bai, Weiqiang, Liang, Chaoqi, Ma, Xinzhu, Dong, Nanqing, Ouyang, Wanli

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights.

bioinformatics, machine learning, natural language, (17 more...)

2412.13716

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceSep-29-2024

Adversarial Examples for DNA Classification

Yoo, Hyunwoo

Pre-trained language models such as DNABERT2 and Nucleotide Transformer, which are trained on DNA sequences, have shown promising performance in DNA sequence classification tasks. The classification ability of these models stems from language models trained on vast amounts of DNA sequence samples, followed by fine-tuning with relatively smaller classification datasets. However, these text-based systems are not robust enough and can be vulnerable to adversarial examples. While adversarial attacks have been widely studied in text classification, there is limited research in DNA sequence classification. In this paper, we adapt commonly used attack algorithms in text classification for DNA sequence classification. We evaluated the impact of various attack methods on DNA sequence classification at the character, word, and sentence levels. Our findings indicate that actual DNA language model sequence classifiers are vulnerable to these attacks.

adversarial example, classification, sequence, (13 more...)

2409.19788

Country:

North America > Montserrat (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(2 more...)

Genre: Research Report > New Finding (0.49)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)